Author - Dhaval Javia
Hello, My name is Dhaval Javia and I am from India currently working at Infosys. I was thinking to moving abroad and for that first thing you search for when you reach there is House to stay (Of course Food comes first but for now let's keep the hungar aside.) So i want house on rent which i can afford, so initially price is the factor for me. After i stabilize in the country, i can go for buying a house so i need some sort of system where i can search and compare house rent prices as per neighborhood.
By the end of this notebook, we should be able to decide housing prices for neighborhood based on features and location and many other dependent parameters. Main focus of audiance is the person living in State for quite some time and now wants a house of his/her own in a choice of neighborhood.
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
Following Data will be used to solve the problem at hand.
We have data available from varius sources but i found this houses on rent price dataset from below sources which requies a lot of data scraping skills and time and processing power. Also, it had all the data avaiable like latitude, longitude, street name, apartment name, rent price, bathrooms and bedrooms. For extra features, we will use Foursquare API for nearby popular locations.
Now, first of all, fetching data from website. Yeah its not an easy task especially when there are 5 home groups on a single webpage which has individual links and each group has multiple options of house as per features and price. Phew!
Now, we have ApartmentName, Pricing as per noofBedRooms and noofBathrooms, location data from Bing API and Neighborhood data from Bing API as well
We will first take a look at the acquired data and remove all unnecessary data and also clean some of the columns as they contain invalid data. For example, pricing column contains some string data which is of no relevance and also, some of the rows does not have neighborhood data or location data. If that is present then also, API has not fetched data properly and provided us data with United States Location which is of no use to us.
So, after cleaning some of the data, we will visulize the dependent variables and independent variables and see if those variables have anny effect on pricing.
After deciding what variables will be of use to us when mmodeling, We will group pricing by neighborhood, and house features use KMeans to cluster similar neighborhoods in a same cluster. We will use cluster = 10 as to gain some more flexiblity and provide user with more choice.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
%matplotlib inline
!pip install geocoder
!pip install selenium
!pip install geopy
!pip install pgeocode
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib.request
import time
!pip install geolocator
!pip install geopy
!pip install geopy
!pip install folium
import json
import re
from geopy.extra.rate_limiter import RateLimiter
import geopy.geocoders as geocoders
from geopy.geocoders import Nominatim
# # fetching response for all 260 webpages to build database
# all_web_pages = {}
# for i in range(0,260):
# url = 'https://www.torontorentals.com/toronto?p=' + str(i+1)
# print(url)
# all_web_pages[i] = requests.get(url)
# #saving webpage reponse in a HTML file for future use.
# for i in range(0,260):
# filename = "./htmls/" + str(i) + '.html'
# print(filename)
# with open(filename, "w", encoding="utf-8") as f:
# f.write(all_web_pages[i].text)
Alright, the tricky part, processing htmls and fetching data between SCRIPT HTML tag. and using BeautifulSoup to get the content inside the tag and using url(inside tag content) to fetch various configurations related to configuration and price.
Once we looped in all the requied data in separate lists. we will create dataframe and save its content in a CSV. Now the following data takes a long time to run as it searches all urls and extracts requied data. Took about 2 hrs to fetch and format data for all 3000 rows.So we have commented below code to prevent execution again.
We can save data to a dataframe and then export it to CSV. We will use that CSV for later use and rerun of project.
# from os import listdir
## listing down columns for dataframe and and creating lists to save data in case code didn't ran properly.
## We can use those lists to append data to dataframe and export it.
# columns = ['ApartmentName','latitude','longitude', 'NoofBedrooms','NoofBathrooms']
# apartname = []
# latitude1 = []
# longi = []
# bedrms = []
# bathrms = []
# priceL = []
# streetnames = []
# #iterating through each html for content
# j = 0
# for each in listdir("./htmls/"):
# filename = './htmls/' + each
# with open(filename, "r", encoding="utf-8") as f:
# response1 = f.read() #saving response of each request to response1 var
# #using soap to find script tag in html content
# soap1 = BeautifulSoup(response1,"html.parser")
# #soap1
# script_dump = soap1.find_all("script")
# #3 to -6 is defined after testing and found that all the necessary elements are in range. Although -1 contains all location,
# # and other url infos which we could have used, but Meh. Ma Project, Ma rules!!!.
# #iterating through each script html tag
# for group in script_dump[3:-6]:
# test1 = group
# test2 = BeautifulSoup(test1.text,"html.parser")
# #saving all necessary things like name, location, streetname, price, apartment type blah blah.
# newDictionary=json.loads(test2.text)
# NameofApartment = newDictionary['name']
# streetAddress = newDictionary['address']['streetAddress']
# print('---------------ApartMent Name:- ',NameofApartment,"------------------")
# #getting url for the group and fetching more data from there like price, apartment type, no of bathrooms, etc.
# response2 = requests.get(newDictionary['url'])
# sub_soap = BeautifulSoup(response2.text,"html.parser")
# sub_soap
# table = sub_soap.find_all('td')
# price_lst = sub_soap.find_all('td',{"class": "price"})
# beds_lst = sub_soap.find_all('td',{"class": "beds"})
# baths_lst = sub_soap.find_all('td',{"class": "baths"})
# beds_lst = sub_soap.find_all('td',{"class": "baths"})
# print(len(price_lst))
# #print("Table:-- ", table)
# #print((len(table)//5)*5)
# #print(table)
# for i in range(0,len(price_lst)):
# # beds = BeautifulSoup((table[i+1]).text,"html.parser")
# # bath = BeautifulSoup((table[i+2]).text,"html.parser")
# # price = BeautifulSoup((table[i+3]).text,"html.parser")
# # #price = str(price).split('\n')[1]
# # print(newDictionary['url'])
# beds = beds_lst[i].text.replace('\n','')
# bath = baths_lst[i].text.replace('\n','')
# price = price_lst[i].text.replace('\n','')
# print("No of bedrooms are {}, No of bathrooms are {}, Price is {}".format(beds,bath,price))
# apartname.append(NameofApartment)
# streetnames.append(streetAddress)
# priceL.append(price)
# bedrms.append(beds)
# bathrms.append(bath)
# #latitude1.append(latitude)
# #longi.append(longitude)
# #saving all data in CSV and dataframe
# #print({'ApartmentName':NameofApartment,'latitude':latitude,'longitude':longitude,'NoofBedrooms':beds,'NoofBathrooms':bath,'Price':price})
# #appending data to dataframe and saving it in a file.
# type(priceL)
# columns = ['ApartmentName', 'Streetname' , 'NoofBedrooms','NoofBathrooms', 'Price']
# print(columns)
# df = pd.DataFrame(columns=columns)
# df['Price'] = priceL
# df['ApartmentName'] = apartname
# df['NoofBathrooms'] = bathrms
# df['NoofBedrooms'] = bedrms
# #df['latitude'] = latitude1
# #df['longitude'] = longi
# df['Streetname'] = streetnames
# #df = df.append({'ApartmentName':NameofApartment,'latitude':latitude,'longitude':longitude,'NoofBedrooms':beds,'NoofBathrooms':bath,'Price':price},ignore_index=True)
# df.head()
#df.to_csv('./raw_df.csv')
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3
def __iter__(self): return 0
# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_43833077c8684b8d83c91e7fabcbb244 = ibm_boto3.client(service_name='s3',
ibm_api_key_id='-d5StrdvYziXsUkU1hMpR9YJzwoeoLUjiTLn2INALVeq',
ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
config=Config(signature_version='oauth'),
endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')
body = client_43833077c8684b8d83c91e7fabcbb244.get_object(Bucket='ibmdscapstoneproject-donotdelete-pr-345qkkpanezvxh',Key='raw_df.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
df = pd.read_csv(body)
df.head()
Reading CSV we saved in previous step and removing unnamed index column
#str(price).split('\n')[1]
#df = pd.read_csv('./raw_df.csv',index_col=None)
df.drop('Unnamed: 0',inplace=True,axis=1) #dropping unnamed column from df
df.reset_index(drop=True)
df.head()
df.shape
Finding neighborhood based on Streetaddress and for failed ones, appending it to list for later use. Although after the execution of the code, it was found that some neighborhood data was incorrect and later next step was used to fix some of those and find NA Neighborhooods data.
#url = 'http://dev.virtualearth.net/REST/v1/Locations/CA/-/-/-/{}?&includeNeighborhood=1&key=AutK2PfFsISHtjUQr00rl2Kf_5tlpgYPtZzJUBZnwl8_NyIOydqyWW91RS4N7NQQ'
# df['Neighborhood'] = ''
# ignored = []
# for Streetname in df.Streetname.unique():
# try:
# url = 'http://dev.virtualearth.net/REST/v1/Locations/CA/-/-/-/{}?&includeNeighborhood=1&key=AutK2PfFsISHtjUQr00rl2Kf_5tlpgYPtZzJUBZnwl8_NyIOydqyWW91RS4N7NQQ'
# url = url.format(Streetname.split(' | ')[0].replace(' ','%20'))
# response = requests.get(url)
# response = response.json()
# print(response['resourceSets'][0]['resources'][0]['address']['neighborhood'])
# print(response['resourceSets'][0]['resources'][0]['point']['coordinates'][0])
# print(response['resourceSets'][0]['resources'][0]['point']['coordinates'][1])
# print('')
# df['Neighborhood'][df['Streetname'] == Streetname] = response['resourceSets'][0]['resources'][0]['address']['neighborhood']
# df['Latitude'][df['Streetname'] == Streetname] = response['resourceSets'][0]['resources'][0]['point']['coordinates'][0]
# df['longitude'][df['Streetname'] == Streetname] = response['resourceSets'][0]['resources'][0]['point']['coordinates'][1]
# except:
# ignored.append(Streetname)
# responses = []
# for each in df[df.Latitude.isnull()].Streetname.unique():
# print(each.split(' | ')[0])
# try:
# url = 'http://dev.virtualearth.net/REST/v1/Locations/CA/-/-/-/{}?&includeNeighborhood=1&key=AutK2PfFsISHtjUQr00rl2Kf_5tlpgYPtZzJUBZnwl8_NyIOydqyWW91RS4N7NQQ'
# url = url.format((each.split(' | ')[0]).replace(' ','%20'))
# response = requests.get(url)
# response = response.json()
# latitude = response['resourceSets'][0]['resources'][0]['point']['coordinates'][0]
# neighborhood = response['resourceSets'][0]['resources'][0]['address']['locality']
# longitude = response['resourceSets'][0]['resources'][0]['point']['coordinates'][1]
# df['Neighborhood'][df['Streetname'] == each] = neighborhood
# df['Latitude'][df['Streetname'] == each] = latitude
# df['longitude'][df['Streetname'] == each] = longitude
# except:
# responses.append(response)
# #df['Neighborhood'][df['Streetname'] == Streetname] = a
# df.dropna(axis=0,inplace=True)
# df.head()
# df.to_csv('./raw_df.csv')
Ok, Now our data looks like this.
# @hidden_cell
# response = requests.get('https://www.torontorentals.com/toronto')
# soap = BeautifulSoup(response.text,"html.parser")
# soap1 = soap.find_all('script')
# soap1 = soap1[-1]
# soap1.text
# import re
# f1 = re.sub("\n \n \n \n \n \n\n ",'', soap1.text)
# f1 = re.sub("\n const markers = \[\];","",f1)
# f1 = re.sub("\n ","",f1)
# f1 = re.sub("\n ",'',f1)
# f1 = re.sub(' ','',f1)
# f1 = re.sub('\n\n','',f1)
# f1 = re.sub('markers.push\(','',f1)
# f1 = re.sub('\n','',f1)
# f1 = re.sub('\)','',f1)
# f2 = re.split(r'\;',f1)
# street_loc_dict = {}
# for i in range(0,len(f2)):
# try:
# a = f2[i]
# #print(a)
# streetname = re.sub("\'","",re.sub("\"","",re.findall(r'street:\s(.*?),',a)[0]))
# name = re.sub("\'","",re.sub("\"","",re.findall(r'name:\s(.*?),',a)[0]))
# lat = re.sub("\'","",re.sub("\"","",re.findall(r'lat: (.*?),',a)[0]))
# lng = re.sub("(.*?)\,lng: ","",re.findall(r'lat: (.*?)\}',a)[0])
# street_loc_dict[streetname] = [lat,lng]
# print(streetname)
# print(street_loc_dict[streetname])
# except:
# pass
# @hidden_cell
# tdf = pd.DataFrame(street_loc_dict.items())
# tdf.rename(columns = {0:'Streetname',1:'location'},inplace=True)
# tdf.head()
# tdf = pd.DataFrame(data=tdf.location.to_list(),index=tdf.Streetname)
# tdf.rename(columns = {0:'Latitude',1:'longitude'},inplace=True)
# tdf.reset_index(inplace=True)
# #tdf[tdf['Streetname'] == '15 Roehampton Ave']
# tdf.to_csv('./test.csv')
# # df_new1 = pd.merge(df_new,tdf,on='Streetname',how='outer')
# # # # df_new1.drop(['Latitude_x','Longitude'],inplace=True,axis = 1)
# # # # df_new1.rename(columns={'Latitude_y':'Latitude'},inplace=True)
# # df_new1.head()
# @hidden_cell
# response = requests.get('https://www.torontorentals.com/toronto')
# soap = BeautifulSoup(response.text,"html.parser")
# script_dump = soap.find_all('script')
# #script_dump[-1]
# @hidden_cell
# f1 = re.sub("\n \n \n \n \n \n\n ",'', script_dump[-1].text)
# f1 = re.sub("\n const markers = \[\];","",f1)
# f1 = re.sub("\n ","",f1)
# f1 = re.sub("\n ",'',f1)
# f1 = re.sub(' ','',f1)
# f1 = re.sub('\n\n','',f1)
# f1 = re.sub('markers.push\(','',f1)
# f1 = re.sub('\n','',f1)
# f1 = re.sub('\)','',f1)
# f2 = re.split(r'\;',f1)
# name_of_apt_dict = {}
# for i in range(0,len(f2)):
# try:
# name = re.sub("\'","",re.sub("\"","",(re.findall(r'name:\s(.*?),',f2[i])[0])))
# streetname = re.sub("\'","",re.sub("\"","",(re.findall(r'street:\s(.*?),',f2[i])[0])))
# name_of_apt_dict[str(name)] = str(streetname.split('|')[0])
# # print("name of apartment:- {}".format(name))
# # print('streetname:- {}'.format(streetname))
# # print(i)
# except:
# pass
df.head()
We have ApartmentName, apartment configs like no of bedrooms, noofbathrooms, price, and other parameters like streetname and location parameters, neighborhoods extracted from Streetname.
Now, what is the most important data here. As per my understaning, neighborhood and average pricing based on features is. We will see how this can be used as we progress further.
Dependent variable(Price) is changing(increasing) as bedrooms and bathrooms( but more effect because of bedrooms variable). We need to co
import folium
toronto_map = folium.Map(location=[43.6532,-79.3832],zoom_start=12)
toronto_map.fit_bounds([[43.581028, -79.542861],[43.855465, -79.170700]])
toronto_map
# for lat,lon, street, apt in zip(df['Latitude'], df['longitude'], df['Streetname'], df['ApartmentName']):
# #print("Borough - {}, Neighbourhood - {}, latitude - {}, longitude - {}".format(borough,neighbourhood,lat,lon))
# try:
# label = folium.Popup(apt,parse_html=True)
# folium.CircleMarker(
# location=[lat,lon],
# color = 'red',
# radius = 5,
# popup=label,
# parse_html = True).add_to(toronto_map)
# except:
# pass
# display(toronto_map)
df.dtypes
Ah, see...i knew it. Data type of Price is object here and we want is float. Also, if you have looked at price variable, it has $ sign and a , sign. which we do not want. So we will go ahead and clean it
df.replace({'Price':'\$'},{'Price':''},regex=True,inplace=True)
df.replace({'Price':','},{'Price':''},regex=True,inplace=True)
df['Price'].astype('float64')
df = df[df['Price'] != 'Inquire']
df['Price'] = df.Price.astype('float64')
df.dtypes
df.Price.describe()
This Min value of 1 seems to be incorrect. (Not seems to be, it is incorrect. Who rents house at 1$. Was the owner on Weed!???). Anyways, we will go ahead and remove this value and see how much is min after that.
df[df['Price'] == 1]
Oh, Just 2 values, seems like some sort of error in data gathering. Anyways, we can remove it rather than go back and fetch the values.
df = df[df['Price'] != 1]
df.Price.describe()
Hmm, Still something is wrong, oh wait, the MAX Price of rent. Oh my God! 19500!!!?? Are you kidding me?. What Harvy Spector lives here or what. What is it Suits??
df[df['Price'] == 19500]
we will remove it ofcourse. Another Data insertion error it seems.
df = df[df['Price'] != 19500]
df.Price.describe()
we will plot graphs to see the outliers.
sns.distplot(df['Price'])
We can definitely see some Suits guy living in a wealthy apartments with rents more than 8k per month. Let's see how many houses are there above 8k rent.
df[df['Price'] > 8000]
df = df[df['Price'] <= 8000]
df.Price.describe()
sns.distplot(df['Price'])
Ah, Much Better. Now, if you google enough and know a thing or 2 about the statstics, you will know about skewness. There are 2 kinds of that. Positive and Negative.
Positively skewed data: If tail is on the right as that of the second image in the figure, it is right skewed data. It is also called positive skewed data. Common transformations of this data include square root, cube root, and log.
Negatively skewed data: If the tail is to the left of data, then it is called left skewed data. It is also called negatively skewed data. Common transformations include square , cube root and logarithmic.
Another method of handling skewness is finding outliers and possibly removing them.
Here, what we have is the positive Skewness data. We can probably use log transform method to fix it. Let's find out if that works or not
def normalize(column):
upper = column.max()
lower = column.min()
y = (column - lower)/(upper-lower)
return y
price_norm = normalize(df.Price)
sns.distplot(price_norm,fit=norm)
fig = plt.figure()
res = stats.probplot(df['Price'], plot=plt)
sns.distplot(np.log(df.Price),fit=norm)
fig = plt.figure()
res = stats.probplot(np.log(df['Price']), plot=plt)
sns.distplot(np.log10(df.Price),fit=norm)
fig = plt.figure()
res = stats.probplot(np.log10(df['Price']), plot=plt)
sns.distplot(np.power(df.Price,1/3),fit=norm)
fig = plt.figure()
res = stats.probplot(np.power(df['Price'],1/3), plot=plt)
Ah, looks perfect!. Isn't she a beauty?
We tried log, log10 and normalization of data using min and maximum values but did not work. Which worked for us was the cube root of the data.(Most of the time it works for highly skewed data)
corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);
Ah, as we predicted, Noofbathrooms and noofbedrooms mare highly correlated with Price.
#saleprice correlation matrix
k = 5 #number of variables for heatmap
cols = corrmat.nlargest(k, 'Price')['Price'].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
sns.set()
cols = ['Price', 'NoofBedrooms', 'NoofBathrooms']
sns.pairplot(df[cols], size = 2.5)
plt.show();
#missing data
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
Great! No missing data as already taken care before.
df.head()
Filtering data between a set of map points such that data falling outside will be ommitted.
df = df[df['Latitude'] <= 44]
df = df[df['Latitude'] >= 43]
df.describe()
df = df[df['longitude'] > -80]
df = df[df['longitude'] < -79]
df.describe()
# df_toronto = df[df['Latitude'] >= 43.713689]
# df_toronto = df_toronto[df_toronto['Latitude'] <= 43.855465]
df = df[df['longitude'] >= -79.63967]
df = df[df['longitude'] <= -79.092422]
df.describe()
df[df['Neighborhood'] == 'Rockcliffe-Smythe']
# one hot encoding
df_onehot = pd.get_dummies(df[['NoofBedrooms','NoofBathrooms','Price']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
df_onehot['Neighborhood'] = df['Neighborhood']
# # move neighborhood column to the first column
# fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
# manhattan_onehot = manhattan_onehot[fixed_columns]
df_onehot = df_onehot.groupby('Neighborhood').mean().reset_index()
creating new dataframe with necessary data and average price.
# df_new = df[['Neighborhood','Price','Latitude','longitude','NoofBedrooms','NoofBathrooms']]
# df_new1 = df_new.groupby(['Neighborhood','NoofBedrooms','NoofBathrooms']).mean().reset_index()
# # df_new1 = df_new1.sort_values(ascending=True,by='Price').reset_index()
# # df_new1 = df_new1.drop('index',1)
# df_new1
# for lat,lon, street, apt in zip(df_toronto['Latitude'], df_toronto['longitude'], df_toronto['Streetname'], df_toronto['ApartmentName']):
# #print("Borough - {}, Neighbourhood - {}, latitude - {}, longitude - {}".format(borough,neighbourhood,lat,lon))
# try:
# label = folium.Popup(apt,parse_html=True)
# folium.CircleMarker(
# location=[lat,lon],
# color = 'blue',
# radius = 3,
# popup=label,
# parse_html = True,clustered_marker = True).add_to(toronto_map)
# except:
# pass
# #toronto_map.fit_bounds([[43.749909, -79.639678],[43.581028, -79.542861],[43.855465, -79.170700],[43.713689, -79.092422]])
# display(toronto_map)
from sklearn.cluster import KMeans
clust = 10
df_toronto_cluster = df_onehot.drop('Neighborhood',1)
kcluster = KMeans(n_clusters=clust, random_state=0).fit(df_toronto_cluster)
clusters = kcluster.labels_
len(clusters)
df_onehot
df_onehot.insert(1,'Cluster',kcluster.labels_)
df_new1 = df_onehot.rename(columns={'Price':'AvgPrice','NoofBedrooms':'AvgBedrooms','NoofBathrooms':'AvgBathrooms'})
#df_new1 = df_new1.drop(['Latitude','longitude','NoofBedrooms','NoofBathrooms'],axis = 1)
df_new1.head()
df_copy = df
df_copy = df_copy.join(df_new1.set_index('Neighborhood'),on='Neighborhood')
df_copy.head()
# set color scheme for the clusters
import matplotlib.cm as cm
import matplotlib.colors as colors
x = np.arange(clust)
ys = [i + x + (i*x)**2 for i in range(clust)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, price, beds, baths in zip(df_copy['Latitude'], df_copy['longitude'], df_copy['Neighborhood'], df_copy['Cluster'], df_copy['AvgPrice'],df_copy['AvgBedrooms'],df_copy['AvgBedrooms']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster) + ' AvgPrice ' + str(price) + ' with avg beds ' + str(beds) + ' and avg bath ' + str(baths) ,parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(toronto_map)
display(toronto_map)
Now, we will go through each cluster or you can click on cluster points in above map. Below, what data you are looking at is important rather than data. Like you want min price and max price range then it is there, also, location range is also there. Explore it your will
df_copy[df_copy['Cluster'] == 0].describe()
df_copy[df_copy['Cluster'] == 1].describe()
### Color:-- Red
df_copy[df_copy['Cluster'] == 2].describe()
df_copy[df_copy['Cluster'] == 3].describe()
df_copy[df_copy['Cluster'] == 4].describe()
df_copy[df_copy['Cluster'] == 5].describe()
df_copy[df_copy['Cluster'] == 6].describe()
df_copy[df_copy['Cluster'] == 7].describe()
df_copy[df_copy['Cluster'] == 8].describe()
df_copy[df_copy['Cluster'] == 9].describe()
Now, we have analyzed data and found groups of neighborhoods of similar kind based on house pricing features. Now, each cluster represents set of neighborhoods.
After analyzing clusters on the map, we can say that most of the clusters are located at the heart of Toronto with high density area. Also, some of the clusters are on the outer region, they do have rail and subway network to keep them connected with the main city. Final decision of the house totally depends on the person's work, rent he/she can afford, how much people one is gonna stay with and is the room is gonna be shared on single.
Here we have made clusters based on average price of a neighborhood and final selection totally depends on the user's final requirements and pricing of the house. Also, location of job/School he/she's enrollled in.
This can be taken one step more with the involvement of the neighboring places and involving them in the clustering process to better isolate close laying neighborhoods. Also, if user requirement is like school should be within 2km. or grocery store near the house at a walking distance, then that should be done after the clusting is complete. That would provide better final outcomes and an overall model which can fit anny user's requirement.